Upgrade AI PR reviewer model: gpt-5.4 → gpt-5.5 (CI + local) by igerber · Pull Request #522 · igerber/diff-diff

igerber · 2026-06-01T16:40:41Z

What

Upgrades the Codex PR-reviewer model from gpt-5.4 → gpt-5.5 for both the CI reviewer and the local /ai-review-local default (codex backend; effort: xhigh / read-only sandbox unchanged).

.github/workflows/ai_pr_review.yml — CI openai/codex-action@v1 model: gpt-5.5
.claude/scripts/openai_review.py — DEFAULT_MODEL = "gpt-5.5"; gpt-5.5 added to _is_reasoning_model
.claude/commands/ai-review-local.md — default-model + reasoning-model doc references
CHANGELOG.md — [Unreleased] entry

No PRICING entry: the reviewer runs gpt-5.5 via the codex backend (subscription / flat-rate), which doesn't consult the api-backend cost table. gpt-5.4 remains accepted.

Why — validated before the swap

Decided empirically, not by assumption (the prior upgrade regressed by missing real issues while changing model + prompt at once):

A/B harness (tools/reviewer-eval/): on a real-bug corpus + a k=6 big-diff de-risk, gpt-5.5 matched-or-beat gpt-5.4 on every test-backed recall case — including a bug buried in a ~3k-line methodology diff — with zero new false positives and faster latency. (The one early apparent divergence was traced to a weak, non-test-backed ground truth, i.e. noise.)
End-to-end CI canary: a same-repo PR carrying this model change + a planted P0 confirmed the real CI environment (openai/codex-action@v1, bundled codex CLI v0.135.0) runs gpt-5.5 (verified from the run-log model banner) and caught the planted P0.

Notes for review

Needs the ready-for-ci label to run the test matrix and become mergeable (CI Gate).
This PR's own AI review will run under gpt-5.5 (the PR-head workflow), against a clean off-current-main diff — a bonus self-check.

- .github/workflows/ai_pr_review.yml: CI Codex reviewer model -> gpt-5.5 (effort xhigh / read-only sandbox unchanged) - .claude/scripts/openai_review.py: DEFAULT_MODEL -> gpt-5.5; add gpt-5.5 to _is_reasoning_model (api-backend timeout/token classification only) - .claude/commands/ai-review-local.md: update default-model + reasoning- model references - CHANGELOG: [Unreleased] entry No PRICING entry: CI + local reviewer run gpt-5.5 via the codex backend (subscription/flat-rate), which does not consult the api-backend cost table. Validated before the swap via tools/reviewer-eval/ A/B (gpt-5.5 >= gpt-5.4 on every test-backed recall case incl. a bug buried in a ~3k-line diff, 0 new false positives, faster) and an end-to-end CI canary (action CLI 0.135.0 runs gpt-5.5 and caught a planted P0). gpt-5.4 remains accepted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-01T16:44:43Z

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. The PR does not touch estimator methodology or statistical outputs. One P2 cost-visibility/documentation issue should be fixed, but it is not a blocker.

Executive Summary

No causal estimator, weighting, variance/SE, inference, or identification behavior is changed.
The gpt-5.5 model update is propagated through CI and the local script default.
_is_reasoning_model() now correctly treats gpt-5.5 as a reasoning model.
OpenAI docs list gpt-5.5 as a Responses-capable reasoning model with xhigh support and $5/$30 per 1M input/output token pricing. (developers.openai.com)
P2: the API-backend cost estimate path was not updated for the new default model, so metered local API runs lose cost visibility.
I could not run pytest because pytest is not installed in this environment; I did run a direct import smoke check.

Methodology

No findings.

The diff only changes reviewer tooling/docs/workflow/changelog files. No estimator implementation, math, weighting, variance/SE, inference fields, assumptions, or default statistical behavior changed, so docs/methodology/REGISTRY.md does not identify an affected causal method.

Code Quality

Finding 1 — P2: API backend loses cost visibility for the new default model

Location: .claude/scripts/openai_review.py:L871-L878, .claude/scripts/openai_review.py:L2013-L2023, .claude/scripts/openai_review.py:L2135-L2151
Impact: DEFAULT_MODEL is now gpt-5.5, and auto can fall back to the API backend when Codex is unavailable. But PRICING still has only gpt-5.4 / gpt-5.4-pro for GPT-5.x, so estimate_cost(..., "gpt-5.5") returns None; the script then silently omits both estimated and actual cost lines for metered API runs. This contradicts the local docs’ cost-visibility promise and matters because OpenAI lists gpt-5.5 API pricing separately from gpt-5.4. (developers.openai.com)
Concrete fix: Add "gpt-5.5": (5.00, 30.00) to PRICING, and add/adjust tests for estimate_cost(..., "gpt-5.5") and estimate_cost(..., "gpt-5.5-2026-04-23").

Performance

No findings.

Maintainability

No findings. The model default and reasoning-model handling are narrowly scoped and consistent across the changed script/workflow paths.

Tech Debt

Finding 2 — P3 informational: workflow contract coverage remains tracked

Location: TODO.md:L177, TODO.md:L204
Impact: The AI review workflow contract is still only partially pinned by tests, which is relevant because this PR changes the workflow model line. This is already tracked under “Tech Debt from Code Reviews,” so it is not a blocker.
Concrete fix: No PR-blocking action required. Eventually implement the tracked workflow-contract test expansion.

Security

No findings. The workflow continues to use ${{ secrets.OPENAI_API_KEY }} and does not introduce new secret exposure.

Documentation/Tests

Finding 3 — P3: local docs understate API-backend cost after the model change

Location: .claude/commands/ai-review-local.md:L17-L20, .claude/commands/ai-review-local.md:L582-L585, CHANGELOG.md:L13-L14
Impact: The docs now label the API backend as gpt-5.5 but leave the old $0.05-0.50/run range and say cost estimates are shown. With gpt-5.5 absent from PRICING, the estimate is not shown; with official gpt-5.5 pricing, the old range can also understate max-cost runs.
Concrete fix: Update the API cost range or remove the fixed range, and align the changelog note with actual behavior: local auto can still use the API backend unless Codex is installed and authenticated.

Verification

Direct import smoke check: DEFAULT_MODEL == "gpt-5.5", _is_reasoning_model("gpt-5.5") is True, _resolve_timeout(None, "gpt-5.5") == 900, and estimate_cost(..., "gpt-5.5") is None.
pytest was not available, so I could not run the targeted test suite.

Add gpt-5.5 ($5/$30) and gpt-5.5-pro ($30/$180) to PRICING at OpenAI's confirmed standard rates (developers.openai.com/api/docs/pricing). The production reviewer uses the flat-rate codex backend, but `--backend auto` falls back to the metered API path when the codex CLI is unavailable — there estimate_cost("gpt-5.5") previously returned None and silently dropped the cost lines. gpt-5.5-pro gets its own entry so it doesn't prefix-fall-back to the standard rate. Also bumps the api-backend cost range note (gpt-5.5 ~2x gpt-5.4) and the CHANGELOG wording. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-01T17:01:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e05a104b67a25918ef501bfa189e91171a3193a3

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. The previous cost-visibility issue is substantially addressed, and the PR does not touch causal estimator methodology or statistical output.

Executive Summary

No estimator, weighting, variance/SE, identification, or causal-method default behavior changed.
gpt-5.5 is propagated through CI, local script default, reasoning-model detection, docs, and API pricing.
Previous P2 cost-visibility finding is resolved for standard gpt-5.5 / gpt-5.5-pro API pricing.
Residual P2: estimate_cost() still treats gpt-5.5 pricing as flat and misses the documented long-context surcharge for >272K input-token sessions. (developers.openai.com)
Targeted smoke checks passed; pytest could not be run because pytest is not installed.

Methodology

No findings.

Affected methods: none. The diff is limited to reviewer tooling/docs/workflow/changelog files and does not change any causal estimator, mathematical formula, weighting scheme, variance/SE computation, inference field, or identification assumption. docs/methodology/REGISTRY.md therefore does not identify an affected method for cross-check.

Code Quality

Finding 1 — P2: GPT-5.5 API cost estimate misses long-context pricing

Location: .claude/scripts/openai_review.py:L871-L905, .claude/scripts/openai_review.py:L2015-L2016
Impact: The new gpt-5.5 standard pricing entries match OpenAI’s listed $5/$30 and gpt-5.5-pro $30/$180 rates, but gpt-5.5 has a documented long-context tier: prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session. The flat (input, output) tuple can understate metered --backend api runs when users raise --token-budget, especially because the local docs still show a 500K-token-budget example. (developers.openai.com)
Concrete fix: Replace the flat tuple model for gpt-5.5 with pricing metadata that applies the long-context multipliers when input_tokens > 272_000, and add tests for below/above-threshold exact and snapshot model names.

Performance

No findings.

Maintainability

No findings. The model default, CI model, Codex command construction, timeout logic, and pricing table are narrowly scoped and consistent.

Tech Debt

Finding 2 — P3 informational: workflow contract test expansion remains tracked

Location: TODO.md:L177, TODO.md:L204
Impact: This PR changes the workflow model line, and broader workflow-contract pinning remains deferred. It is already tracked under “Tech Debt from Code Reviews,” so this is not a blocker.
Concrete fix: No PR-blocking action required; eventually implement the tracked workflow-contract test expansion.

Security

No findings. The workflow continues to use ${{ secrets.OPENAI_API_KEY }} and does not add new secret exposure.

Documentation/Tests

Finding 3 — P3: new gpt-5.5 paths lack regression tests

Location: tests/test_openai_review.py:L670-L685, tests/test_openai_review.py:L1545-L1618
Impact: Existing tests cover gpt-5.4 pricing, reasoning detection, and timeout behavior, but not the new default gpt-5.5 paths. A future regression could remove gpt-5.5 pricing or reasoning classification without a targeted failure.
Concrete fix: Add tests for DEFAULT_MODEL == "gpt-5.5", _is_reasoning_model("gpt-5.5"), _resolve_timeout(None, "gpt-5.5"), estimate_cost(..., "gpt-5.5"), and estimate_cost(..., "gpt-5.5-2026-04-23").

Verification performed:

git diff --check passed.
Direct import smoke check passed for default model, reasoning detection, timeout, exact/snapshot pricing, and pro snapshot pricing.
python -m pytest ... could not run: No module named pytest.

igerber added the ready-for-ci Triggers CI test workflows label Jun 1, 2026

igerber merged commit c7d2360 into main Jun 1, 2026
24 of 26 checks passed

igerber deleted the chore/codex-reviewer-gpt-5.5 branch June 1, 2026 19:15

igerber mentioned this pull request Jun 2, 2026

Bump version to 3.5.1 #526

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade AI PR reviewer model: gpt-5.4 → gpt-5.5 (CI + local)#522

Upgrade AI PR reviewer model: gpt-5.4 → gpt-5.5 (CI + local)#522
igerber merged 2 commits into
mainfrom
chore/codex-reviewer-gpt-5.5

igerber commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Jun 1, 2026

What

Why — validated before the swap

Notes for review

Uh oh!

github-actions Bot commented Jun 1, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented Jun 1, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant